# data handling
library(curl)
library(tidyverse)
# discriminant analysis
library(MASS)
# data visualization
library(ggplot2)
library(plotly)
library(geometry)
library(mulgar)
library(ggpubr)
Discriminant analysis is a type of statistical analysis used to predict the probability of belonging to a given class or category based on one or more predictor variables. We will be covering two different types of discriminant analysis: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA).
Linear discriminant analysis (LDA) is used to find a linear combination of features that separates different classes in a data set. LDA works by calculating the means and covariance matrices for each data class, assuming a common covariance matrix between all classes. When effectively utilized, LDA finds the linear combination of variables that provide the best possible separation between the groups.
LDAs operate by projecting a feature space (a data set of n dimensions) onto a smaller space “g”, where \(g \le n-1\) without losing class information.
A covariance matrix that measures how each variable or feature relates to others within the same class.
An LDA model comprises the statistical properties that are calculated for the data in each class - where there are multiple variables, these properties are calculated over the multivariate Gaussian (normal) distribution. The model assumes the following:
the input data set has a normal distribution
the data set is linearly separable, meaning LDA can draw a decision boundary that separates the data points
each class has the same covariance matrix
The covariance matrix represents the between-group sum of squares matrix as such:
\[ B = \sum_{k=1}^{g}n_k \, (\bar{X}_k - \bar{X})(\bar{X}_k - \bar{X})^\top \]
This measures the differences between the class means, compared with the mean of the whole data set, \(\bar{X}\) and the within-group sum of squares matrix,
\[ W = \sum_{k = 1}^g \sum_{i = 1}^{n_k} \, (X_{ki} - \bar{X}_k)(X_{ki} - \bar{X}_k)^\top \]
which measures the variation of values around each class mean, \(X_{ki}\).
The linear discriminant space is generated by computing the eigenvectors of \(W^{-1}B\), and this is the g dimensional space where the group means are most separated with respect to the pooled covariance. For each class, we compute:
\[ \delta_k(x) = (x-\mu_k)^\top W^{-1} \mu_k + \log{\pi}_k \]
where \(\pi_k\) is a prior probability for class \(k\). The LDA classifier rule is to assign a new observation to the class with the largest value of \(\delta_k(x)\).
Let’s start by simulating some data. Linear Discriminant Analysis works best with normal, continuous variables with low colinearity. Considering this, let’s imagine a study where a geneticist wants to classify a monkey into one of three genotypes based on the number of transcriptions of certain genes:
# First, let's set a seed for reproducibility,
set.seed(647)
# then set n as the number of observations to simulate.
n <- 300
# Now we can simulate the transcription frequencies.
# Let's assume that the classification is set up as such:
# Genotype A: higher expression of gene X
# Genotype B: higher expression of gene Y
# Genotype C: higher expression of gene Z
geno <- sample(c(0, 1, 2), size = n, replace = TRUE)
# now the gene expressions
gene_x <- rnorm(n, mean = 5 - (2 * geno), sd = 1)
# this will produce a distribution that is higher in genotype B individuals
gene_y <- rnorm(n, mean = 5 + (2 * geno), sd = 1)
# this will produce a distribution that is higher in genotype A individuals
gene_z <- rnorm(n, mean = 5 + (2 * (geno == 2)), sd = 1)
# this will produce a distribution that is higher in genotype C individuals
# Now, let's store this simulated data
lda_data <- data.frame(
gene_x = gene_x,
gene_y = gene_y,
gene_z = gene_z,
geno = factor(geno, labels = c("B", "A", "C"))
)
# and lastly, check the first few rows to make sure everything is right
head(lda_data)
| gene_x | gene_y | gene_z | geno |
|---|---|---|---|
| 4.200588 | 4.518938 | 2.835381 | B |
| 3.933105 | 7.303336 | 5.784455 | A |
| 4.971770 | 5.305419 | 5.695243 | B |
| 4.231134 | 6.795563 | 4.026929 | A |
| 4.881066 | 5.391924 | 6.790874 | B |
| 1.117153 | 8.388053 | 7.003496 | C |
Linear discriminant analysis reduces the number of dimensions by creating a new axis while simultaneously considering two criteria:
When we have two classes , we can accomplish this by expressing the rate:
\[ \frac{(\mu_1 - \mu_2)^2}{s^2_1 + s^2_2} = \frac{d^2}{s^2_1 + s^2_2} \]
For an example, lets start by calculating the values we need. The means are easy, but we have to make sure to distinguish between genotype and gene.
mu1 <- colMeans(lda_data[lda_data$geno == "A", 1:2])
mu2 <- colMeans(lda_data[lda_data$geno == "B", 1:2])
Things get a little tricky when it comes to spread. Because LDA
assumes that the data all share a covariance matrix (a.k.a pooled
covariance matrix), we can’t just use the spread of the raw data. The
covariance matrix is computed internally by lda(), but it
can be computed manually like this:
S1 <- cov(lda_data[lda_data$geno == "A", 1:2])
S2 <- cov(lda_data[lda_data$geno == "B", 1:2])
n1 <- sum(lda_data$geno == "A")
n2 <- sum(lda_data$geno == "B")
cov_pooled <- ((n1 - 1)*S1 + (n2 - 1)*S2) / (n1 + n2 - 2)
cov_pooled
## gene_x gene_y
## gene_x 0.90394338 -0.04142668
## gene_y -0.04142668 0.99434430
Division doesn’t work well with matrices, so to calculate the eigenvalues that serve as coefficients for the linear boundary, we have to invert the covariance matrix \(W\):
cov_inv <- solve(cov_pooled)
and use matrix multiplication, %*%, to multiply the
inverse matrix, \(W^{-1}\), by the
difference between the two means, a.k.a. the distance between the
classes. This is effectively an expression of \[
\frac{d^2}{s^2_1 + s^2_2}
\]
using \(W^{-1}\) instead of the sum of spreads to calculate the “weights” (w) represented by the eigenvalues and:
\[ \frac{d}{W} = d * W^{-1} \]
d <- (mu1 - mu2)
w <- as.vector(cov_inv %*% d)
b <- -0.5 * t(mu1 + mu2) %*% cov_inv %*% d
Let’s plot this boundary and see where the best place to split the classes is
# Filter out geno C data
lda_data_xy <- lda_data[lda_data$geno != "C", ]
# Create a grid to visualize the boundary
x_vals <- seq(min(lda_data_xy$gene_x), max(lda_data_xy$gene_x), length.out = 100)
y_vals <- (-w[1] * x_vals - b) / w[2] # Solving for y
boundary <- data.frame(gene_x = x_vals, gene_y = as.vector(y_vals))
ggplot(lda_data_xy, aes(x = gene_x, y = gene_y, color = geno)) +
geom_point() +
geom_line(data = boundary, aes(x = x_vals, y = y_vals), color = "black", linetype = "dashed") +
labs(title = "LDA Boundary for gene_x vs gene_y") +
coord_fixed() +
theme_bw()
Now let’s project these onto a lower dimension. First, we should make sure our LDA vector is normalized, a.k.a made to be a unit vector
w_unit <- w/sqrt(sum(w^2))
Now, center the data about the means
X <- as.matrix(lda_data_xy[, c("gene_x", "gene_y")])
# Center around the global mean
X_centered <- scale(X, center = TRUE, scale = FALSE)
and finally, project the data onto the discriminant axis.
# Projection of each point onto the LDA axis
lda_proj <- X_centered %*% w_unit
lda_data_xy$lda_proj <- as.vector(lda_proj)
Now let’s see this in context
proj_coords <- X_centered %*% w_unit %*% t(w_unit)
global_mean <- colMeans(X)
proj_coords <- sweep(proj_coords, 2, global_mean, FUN = "+")
# Build a dataframe for plotting
proj_df <- data.frame(
gene_x_proj = proj_coords[, 1],
gene_y_proj = proj_coords[, 2],
geno = lda_data_xy$geno
)
# Add the projected points to the plot
ggplot() +
geom_point(data = lda_data_xy, aes(x = gene_x, y = gene_y, color = geno), alpha = 0.6, size = 2) +
geom_point(data = proj_df, aes(x = gene_x_proj, y = gene_y_proj, color = geno), shape = 4, size = 2.5) +
geom_segment(aes(x = lda_data_xy$gene_x, y = lda_data_xy$gene_y,
xend = proj_df$gene_x_proj, yend = proj_df$gene_y_proj,
color = lda_data_xy$geno),
alpha = 0.4) +
labs(title = "Projection of Points onto LDA Axis", x = "gene_x", y = "gene_y") +
coord_fixed() +
theme_bw()
And finally, let’s see both the boundary and discriminant axis together
ggplot() +
geom_point(data = lda_data_xy, aes(x = gene_x, y = gene_y, color = geno), alpha = 0.6, size = 2) +
geom_point(data = proj_df, aes(x = gene_x_proj, y = gene_y_proj, color = geno), shape = 4, size = 2.5) +
geom_segment(aes(x = lda_data_xy$gene_x, y = lda_data_xy$gene_y,
xend = proj_df$gene_x_proj, yend = proj_df$gene_y_proj,
color = lda_data_xy$geno),
alpha = 0.4) +
geom_line(data = boundary, aes(x = x_vals, y = y_vals), color = "black", linetype = "dashed") +
labs(title = "LDA Boundary and Axis", x = "gene_x", y = "gene_y") +
coord_fixed() +
theme_bw()
And we can see that the boundary line that divides the group is perpendicular to the LDA axis that shows the direction of separation.
So you might be wondering, if \(d\) is the difference between the means of each category, how do we use this formula with more than 2 categories? We can’t just use the distance between any two.
Well, we use a central point (C) that is equidistant from the mean of each class. So instead of maximizing the distance between classes, we try to maximize the distance of each class from C while still minimizing the scatter within each category.
Instead of
\[ \frac{(d)^2}{s^2_1 + s^2_2} \]
we use
\[ \frac{d_1^2 + d_2^2 + ... + d_n^2}{s_1^2 + s_2^2 + ... + s_n^2} \]
where
\(n\) = the number of classes
\(d_n^2\) is the distance between a given class and the central point, C
\(s_n^2\) is the spread of a given class
Let’s use the function lda() from the
{MASS} package to check out LDA with multiple classes.
lda_model <- lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)
# Check model output
print(lda_model)
## Call:
## lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)
##
## Prior probabilities of groups:
## B A C
## 0.2933333 0.3500000 0.3566667
##
## Group means:
## gene_x gene_y gene_z
## B 5.095949 5.085835 4.974837
## A 3.027599 7.129011 5.118581
## C 1.083457 9.167254 7.087386
##
## Coefficients of linear discriminants:
## LD1 LD2
## gene_x -0.5976035 0.2789010
## gene_y 0.7212829 -0.2496987
## gene_z 0.3574859 0.9554625
##
## Proportion of trace:
## LD1 LD2
## 0.9711 0.0289
In the output we see:
the prior probabilities, which represents the proportion of spread explained by each class discriminant
the group means, which are the means of each gene’s spread in each genotype
the coefficients of the linear discriminants, which represent the direction of separation and are the eigenvectors of \(W^{-1}B\). The higher the absolute value of each coefficient, the more discriminative power it has.
These values inform the predictions made by projecting the data onto
a lower dimension space. Luckily, {MASS} has the
predict() function so we don’t have to do it by hand
again.
lda_pred <- predict(lda_model, lda_data)
# The LDA projection scores for each observation
lda_data$LD1 <- lda_pred$x[,1]
lda_data$LD2 <- lda_pred$x[,2]
Now, let’s visualize this
# Create a grid of points to evaluate the decision boundary
grid <- expand.grid(LD1 = seq(min(lda_data$LD1) - 1, max(lda_data$LD1) + 1, length.out = 100),
LD2 = seq(min(lda_data$LD2) - 1, max(lda_data$LD2) + 1, length.out = 100))
grid$gene_x <- (grid$LD1) # Placeholder for gene_x, gene_y, gene_z transformation
grid$gene_y <- (grid$LD2) # Same for gene_y
grid$gene_z <- 1 # If you want to include gene_z as a constant, modify as necessary.
# Predict the class for each point in the grid
grid$pred_geno <- predict(lda_model, newdata = grid)$class
# Plot with decision boundary
ggplot(lda_data, aes(x = LD1, y = LD2, color = geno)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "2D LDA Projection with Decision Boundary",
x = "Linear Discriminant 1 (LD1)",
y = "Linear Discriminant 2 (LD2)") +
geom_contour(data = grid, aes(x = LD1, y = LD2, z = as.numeric(pred_geno)), color = "black", bins = 1) +
theme_bw() +
theme(aspect.ratio = 1, legend.title = element_blank())
Here, the line is squiggly because it represents a planar version of the LDA boundary. It might be easier to see by restructuring the plot - one common method for this in LDA is through ellipses:
ell <- NULL
for(i in unique(lda_data$geno)){
x <- lda_data |> dplyr::filter(geno == i)
e <- gen_xvar_ellipse(x[, 1:3], n = 300, nstd = 1)
e$geno <- i
ell <- bind_rows(ell, e)
}
lda1 <- ggplot(lda_data,
aes(x = gene_x, y = gene_y, color = geno)) +
geom_point() +
scale_color_discrete() +
ggtitle("(a)") +
theme_bw() +
theme(aspect.ratio = 1, legend.title = element_blank())
lda2 <- ggplot(ell,
aes(x = gene_x, y = gene_y, color = geno)) +
geom_point() +
scale_color_discrete() +
ggtitle("(b)") +
theme_bw() +
theme(aspect.ratio = 1, legend.title = element_blank())
ggarrange(lda1, lda2, ncol = 2, common.legend = TRUE, legend = "bottom")
# Create a 3D scatter plot with LD1, LD2, and LD3
fig <- plot_ly(data = ell,
x = ~gene_x,
y = ~gene_y,
z = ~gene_z,
color = ~geno,
type = 'scatter3d',
mode = 'markers',
marker = list(size = 5)) %>%
layout(title = "3D LDA Projection (LD1, LD2, LD3)",
scene = list(xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2')))
fig
Quadratic discriminant analysis is similar to LDA, but assumes that that each class has its own unique covariance matrix. QDA can be thought of as an extension to LDA, as it is still classifying observations from a set of predictor variables. Since each class has its own, unique covariance matrix, QDA allows for complex, non-linear modeling.
First things first, we’ll load in the data that we need and clean it up to get that out of the way.
# store data from Kamilar and Cooper in holder f
f <- curl("https://raw.githubusercontent.com/bygentry/AN588_Vignette/refs/heads/main/KamilarAndCooperData.csv")
# convert data in f to a data frame and store in data.frame d
d <- read.csv(f, sep = ",", stringsAsFactors = TRUE)
# Using complete.cases() from the built-in stats package, we can remove rows with NA in specific columns
tidy.data <- d[complete.cases(d[ , c(5, 8:10)]), ]